What are Anomalies?
Get an introduction of anomalies in a dataset, and understand the usage of mean and standard deviation in identifying them.
We'll cover the following
Introduction#
An anomaly in a data series is a significant deviation from some reasonable value. Looking at this series of numbers. For example, which number stands out?
The number that stands out in this series is 12.
This is intuitive to a human, but computer programs do not have that intuition…
Mathematical foundation#
To find the anomaly in the series, we first need to define a reasonable value and then define how far away we consider a significant deviation from this value:
The mean is ~4.33.
Next, we need to define the deviation. Let’s use Standard Deviation:
Standard deviation is the square root of the variance, which is the average squared distance from the mean. In this case, it is 3.08.
Now that we have defined a “reasonable” value and a deviation, we can define a range of acceptable values:
The range we defined is one standard deviation from the mean. Any value outside this range is considered an anomaly:
Using the query, we found that the value 12 is outside the range of acceptable values and identified it as an anomaly.